{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook implements **solutions** presented in the [README](https://github.com/chauvinSimon/hmm_for_autonomous_driving/blob/master/README.md).\n",
    "\n",
    "Some questions are addressed using the [hmmlearn](https://hmmlearn.readthedocs.io/en/latest/index.html#) python package.\n",
    "- To install it on **Windows**, you may want to get an already [compiled version](https://www.lfd.uci.edu/~gohlke/pythonlibs/#hmmlearn)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from hmmlearn import hmm\n",
    "import numpy as np\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "seed = 93\n",
    "np.random.seed(seed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "states = [\"left_lane\", \"right_lane\"]\n",
    "observations = [\"low_speed\", \"high_speed\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "discrete_model = hmm.MultinomialHMM(n_components=2,\n",
    "                                    algorithm='viterbi',  # decoder algorithm.\n",
    "                                    random_state=seed,\n",
    "                                    n_iter=10,\n",
    "                                    tol=0.01  # EM convergence threshold (gain in log-likelihood)\n",
    "                                   )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Q1](https://github.com/chauvinSimon/hmm_for_autonomous_driving/blob/master/README.md#q1) How to easily estimate the parameters of our HMM?\n",
    "Here are the results of the estimation procedure based on **counting from the labelled data**. These estimates constitute the **MLE**.\n",
    "\n",
    "This question gives the chance to mentiont the **stationary state distribution**.\n",
    "- Using **sampling**.\n",
    "- Using the **transition matrix**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "discrete_model.startprob_ = np.array(\n",
    "    [1/3, 2/3]\n",
    ")\n",
    "discrete_model.transmat_ = np.array([\n",
    "    [0.6, 0.4],  # P(state_t+1|state_t=state_0)\n",
    "    [0.2, 0.8]]  # P(state_t+1|state_t=state_1)\n",
    ")\n",
    "discrete_model.emissionprob_ = np.array(\n",
    "    [[0.4, 0.6],  # P(obs|state_0)\n",
    "     [0.8, 0.2]]  # P(obs|state_1)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stationary distributions from sampling\n",
    "Generate random samples from the HMM model, using the derived parameters.\n",
    "\n",
    "A **large sample** is drawn.\n",
    "- Counting occurences should lead to the **stationary state distributions**.\n",
    "- `1/3` vs `2/3`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Counter({1: 66134, 0: 33866})\n",
      "Counter({0: 66520, 1: 33480})\n"
     ]
    }
   ],
   "source": [
    "sample_obs, sample_states = discrete_model.sample(100000)\n",
    "print(Counter(sample_states.flatten()))\n",
    "print(Counter(sample_obs.flatten()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stationary distribution from the transition matrix\n",
    "The **stationary distribution** is a **left eigenvector** (as opposed to the usual right eigenvectors) of the **transition matrix**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.33333333 0.66666667]\n"
     ]
    }
   ],
   "source": [
    "# Compute the stationary distribution of states.\n",
    "eigvals, eigvecs = np.linalg.eig(discrete_model.transmat_.T)\n",
    "eigvec = np.real_if_close(eigvecs[:, np.argmax(eigvals)])\n",
    "\n",
    "# normalisation\n",
    "stat_distr = eigvec / eigvec.sum()\n",
    "print(stat_distr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Q2](https://github.com/chauvinSimon/hmm_for_autonomous_driving/blob/master/README.md#q2) - Given a single speed observation, what is the probability for the car to be in each of the two lanes?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, two methods of `hmmlearn` are used:\n",
    "- The built-in [scoring](https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn.hmm.MultinomialHMM.predict) method.\n",
    "- The built-in [decoding](https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn.hmm.MultinomialHMM.decode) method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Posterior probabilities\n",
    "In the [README](README.md), the **posterior conditional probabilities** were computed using Bayes' rule. For instance:\n",
    "- p(`left_lane`|`low_speed`) = p(`low_speed`|`left_lane`) * p(`left_speed`) / p(`low_speed`) \n",
    "\n",
    "**Posterior probabilities** for each state can be computed with `predict_proba()` and corresponding most likely states with `predict()`:\n",
    "- P(`left lane`  | `high speed`) = `1/3` `*` `0.6` / `1/3` = `0.6`\n",
    "- P(`left lane`  | `low speed`)  = `1/3` `*` `0.4` / `2/3` = `0.2`\n",
    "- P(`right lane` | `high speed`) = `2/3` `*` `0.2` / `1/3` = `0.4`\n",
    "- P(`right lane` | `low speed`)  = `2/3` `*` `0.8` / `2/3` = `0.8`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p(state|'low_speed') = [[0.2 0.8]] => most likely state is 'right_lane'\n",
      "p(state|'high_speed') = [[0.6 0.4]] => most likely state is 'left_lane'\n"
     ]
    }
   ],
   "source": [
    "for i in range(len(observations)):\n",
    "    p = discrete_model.predict_proba(np.array([[i]]))\n",
    "    most_likely = discrete_model.predict(np.array([[i]]))\n",
    "    print(\"p(state|'{}') = {} => most likely state is '{}'\".format(\n",
    "        observations[i], p, states[int(most_likely[0])]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Joint probabilities\n",
    "A second approach is to compute the **join probabilities**, as used in the **Viterbi algorithm**:\n",
    "- p(`left_lane`,`low_speed`) = p(`low_speed`|`left_lane`) * p(`left_speed`)\n",
    "- This corresponds to **alpha(`right_lane`, `t=1`)** for observation [`low_speed`].\n",
    "\n",
    "Join probabilities:\n",
    "- P(`left lane`  , `high speed`) = `1/3` `*` `0.6` = `3/15`  # *highest alpha for* `obs` = [`high_speed`] \n",
    "- P(`right lane` , `high speed`) = `2/3` `*` `0.2` = `2/15`\n",
    "- P(`left lane`  , `low speed`)  = `1/3` `*` `0.4` = `2/15`\n",
    "- P(`right lane` , `low speed`)  = `2/3` `*` `0.8` = `8/15`  # *highest alpha for* `obs` = [`low_speed`] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "right_lane -> low_speed\n",
      "prob = 0.5333333333333333\n",
      "---\n",
      "left_lane -> high_speed\n",
      "prob = 0.19999999999999998\n"
     ]
    }
   ],
   "source": [
    "# The Viterbi decoding algorithm uses the `alpha*` values.\n",
    "# And the first `alpha*` are `alpha` values.\n",
    "# Hence, the decoding method should return the state with the **highest joint probability**.\n",
    "obs_0 = np.array([[0]]).T\n",
    "obs_1 = np.array([[1]]).T\n",
    "# Log probability of the maximum likelihood path through the HMM\n",
    "logprob_0, state_0 = discrete_model.decode(obs_0)  # 8/15 = alpha[`low_speed`](`right_lane`)\n",
    "logprob_1, state_1 = discrete_model.decode(obs_1)  # 2/5 = alpha[`high_speed`](`left_lane`)\n",
    "\n",
    "print(\"{} -> {}\".format(states[int(state_0)], observations[int(obs_0)]))\n",
    "print(\"prob = {}\".format(np.exp(logprob_0)))\n",
    "print(\"---\")\n",
    "print(\"{} -> {}\".format(states[int(state_1)], observations[int(obs_1)]))\n",
    "print(\"prob = {}\".format(np.exp(logprob_1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Q3](https://github.com/chauvinSimon/hmm_for_autonomous_driving/blob/master/README.md#q3) - What is the probability to observe a particular sequence of speed measurements?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The presented solution uses the [scoring](https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn.hmm.MultinomialHMM.predict) method of `hmmlearn`.\n",
    "\n",
    "The **marginal probabilities of an observation sequence** can be found by:\n",
    "- **The sum over any column of the product `alpha` \\* `beta`** (done in th next question).\n",
    "- In `hmmlearn`, this is given by the `score()` method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### For a single observation:\n",
    "- P(`low speed`) = P(`left lane`, `low speed`) + P(`right lane`, `low speed`) = `2/15` + `8/15` = `2/3`\n",
    "- P(`high speed`) = P(`left lane`, `high speed`) + P(`right lane`, `high speed`) = `3/15` + `2/15` = `1/3`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p(low_speed) = 0.6666666666666667 \n",
      "p(high_speed) = 0.33333333333333337 \n",
      "---\n",
      "p(low_speed) = 0.6666666666666667\n",
      "   posterior p(state|low_speed) = [[0.2 0.8]]\n",
      "p(high_speed) = 0.33333333333333337\n",
      "   posterior p(state|high_speed) = [[0.6 0.4]]\n"
     ]
    }
   ],
   "source": [
    "# For a single observation\n",
    "# Compute the log probability under the model.\n",
    "for i in range(len(observations)):\n",
    "    p = np.exp(discrete_model.score(np.array([[i]])))\n",
    "    print(\"p({}) = {} \".format(observations[i], p))\n",
    "\n",
    "print(\"---\")\n",
    "# Compute the log probability under the model and compute posteriors.\n",
    "for i in range(len(observations)):\n",
    "    p = (discrete_model.score_samples(np.array([[i]])))\n",
    "    print(\"p({}) = {}\\n   posterior p(state|{}) = {}\".format(observations[i], np.exp(p[0]), observations[i],  p[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### For the example used in [README](https://github.com/chauvinSimon/hmm_for_autonomous_driving/blob/master/README.md) : [`low_speed`, `high_speed`, `low_speed`]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p(['low_speed', 'high_speed', 'low_speed']) = 0.13184 \n"
     ]
    }
   ],
   "source": [
    "obs_sequence = np.array([[0, 1, 0]]).T\n",
    "p = np.exp(discrete_model.score(obs_sequence))\n",
    "print(\"p({}) = {} \".format([observations[i] for i in obs_sequence.T[0]], p))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Q4](https://github.com/chauvinSimon/hmm_for_autonomous_driving/blob/master/README.md#q4) - Given a sequence of speed observations, what is the most likely current lane?\n",
    "\n",
    "This question about **filtering** gives the chance to introduce the **Forward Algorithm** (and the **Backward Algorithm**).\n",
    "- First, the **Forward Algorithm** is implemented to build the `alpha` table.\n",
    "- Then, **filtering** can be completed by normalizing `alpha` values (using marginals).\n",
    "\n",
    "From the `alpha` and `beta` tables, **marginal probabilities** are computed.\n",
    "- For the **full observation sequence**.\n",
    "- For **sub-sequences** of the observation sequence.\n",
    "\n",
    "Finally, and as for the [README](README.md#q4), multiple inference techniques are presented.\n",
    "- **Filtering**\n",
    "- **Smoothing**\n",
    "- **Prediction**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_state = discrete_model.n_components\n",
    "start_prob = discrete_model.startprob_\n",
    "emit_prob = discrete_model.emissionprob_\n",
    "transmat_prob = discrete_model.transmat_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Forward algorithm\n",
    "\n",
    "Used to build the `alpha` table for an observation sequence.\n",
    "\n",
    "Initialization:\n",
    "- `alpha`(`i`, `t=0`) = p(`obs_0`, `lane_i`) = p(`obs_0`|`lane_i`) * p(`lane_i`)\n",
    "\n",
    "Recursion:\n",
    "- `alpha`(`i`,`t+1`) = [emission at `t+1`] * SUM[over `state_j`][transition `t`->`t+1`*`alpha`(`j`,`t`)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def forward_algo(obs_seq):\n",
    "    alpha = np.zeros((len(obs_seq), n_state))\n",
    "    alpha[0] = np.transpose(emit_prob)[obs_seq[0]] * start_prob\n",
    "    for t in range(alpha.shape[0]-1):\n",
    "        alpha[t+1] = np.transpose(emit_prob)[obs_seq[t+1]] * np.dot(alpha[t], transmat_prob)\n",
    "    return alpha"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.13333333 0.53333333]\n",
      " [0.112      0.096     ]\n",
      " [0.03456    0.09728   ]]\n"
     ]
    }
   ],
   "source": [
    "obs_sequence = np.array([[0, 1, 0]]).T\n",
    "alpha = forward_algo(obs_sequence)\n",
    "print(alpha)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Backward Algorithm\n",
    "Used to build the `alpha` table for an observation sequence.\n",
    "Initialization:\n",
    "- `beta`(`i`, `t=T`) = `1`\n",
    "\n",
    "Recursion:\n",
    "- `beta`(`i`,`t+1`) = SUM[over `state_j`][emission at `t+1` * transition `t`->`t+1` * `beta`(`j`,`t`)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def backward_algo(obs_seq):\n",
    "    beta = np.zeros((len(obs_seq), n_state))\n",
    "    beta[len(obs_seq) - 1] = np.ones((n_state))\n",
    "    for t in reversed(range(len(obs_seq)-1)):\n",
    "        beta[t] = np.dot(beta[t + 1] * np.transpose(emit_prob)[obs_seq[t + 1]], np.transpose(transmat_prob))\n",
    "    return beta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.2592 0.1824]\n",
      " [0.56   0.72  ]\n",
      " [1.     1.    ]]\n"
     ]
    }
   ],
   "source": [
    "beta = backward_algo(obs_sequence)\n",
    "print(beta)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Marginals for observation sub-sequences\n",
    "Summing over the `t`-th `alpha` column gives the probability of observing the sub-sequence `obs_sequence`[`:, t`].\n",
    "\n",
    "- `sub_seq_marginals`[i] = p(`obs_1` ... `obs_i`) = sum over `k` of `alpha(`t`=`i`, `lane`=`k`)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.66666667 0.208      0.13184   ]\n"
     ]
    }
   ],
   "source": [
    "sub_seq_marginals = np.sum(alpha, axis=1)\n",
    "print(sub_seq_marginals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Marginal of the full observation sequence\n",
    "The marginal probability of the observation sequence can be obtained\n",
    "by summing the product `alpha`*`beta` at any `t`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.13184 0.13184 0.13184]\n"
     ]
    }
   ],
   "source": [
    "prod = np.multiply(alpha, beta)\n",
    "marginals = np.sum(prod, axis=1)\n",
    "print(marginals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Filtering\n",
    "Filtering can be obtained by normalizing the last `alpha` values (using the marginal probability of the full observation sequence)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.26213592 0.73786408]\n"
     ]
    }
   ],
   "source": [
    "print(alpha[-1]/marginals[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prediction\n",
    "The `pi` variable is introduced:\n",
    "- `pi`(`lane i`, `time t+k+1`) = SUM over `state` `j` of [transition `j`->`i` * `pi`(`lane i`, `time t+k`)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pi_algo(k, alpha, beta):\n",
    "    pi = np.zeros((k, n_state))\n",
    "    pi[0]=alpha[0]/marginals[0]\n",
    "    for t in range(pi.shape[0]-1): \n",
    "        pi[t+1] = np.dot(pi[t], transmat_prob)\n",
    "    return pi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1.01132686 4.04530744]\n",
      " [1.41585761 3.6407767 ]\n",
      " [1.5776699  3.4789644 ]\n",
      " [1.64239482 3.41423948]]\n"
     ]
    }
   ],
   "source": [
    "pi = pi_algo(4, alpha, beta)\n",
    "print(pi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Smoothing\n",
    "The `gamma` variable is introduced:\n",
    "- `gamma`(`k`, `t`) = p(`lane(t)`=`k` | `[observation sequence (1...T)]`)\n",
    "- for `t` in [`1` ... `T`]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "def gamma_algo(alpha, beta):\n",
    "    prod = np.multiply(alpha, beta)\n",
    "    marginals = np.sum(prod, axis=1)\n",
    "    gamma = np.divide(prod, marginals[:, np.newaxis])\n",
    "    return gamma"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.26213592 0.73786408]\n",
      " [0.47572816 0.52427184]\n",
      " [0.26213592 0.73786408]]\n"
     ]
    }
   ],
   "source": [
    "gamma = gamma_algo(alpha, beta)\n",
    "print(gamma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Q5](https://github.com/chauvinSimon/hmm_for_autonomous_driving/blob/master/README.md#q5) - Given a sequence of speed observations, what is the most likely underlying lane sequence?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Decoding\n",
    "The **inferred optimal hidden states** can be obtained by calling [`decode()`](https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn.hmm.MultinomialHMM.decode) method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "right_lane -> low_speed\n",
      "right_lane -> high_speed\n",
      "right_lane -> low_speed\n",
      "\n",
      "prob = 0.05461333333333335\n",
      "log_prob = -2.9074772257991035\n"
     ]
    }
   ],
   "source": [
    "obs_sequence = np.array([[0, 1, 0]]).T\n",
    "# Find most likely state sequence corresponding to obs_sequence\n",
    "logprob, state_sequence = discrete_model.decode(obs_sequence)\n",
    "\n",
    "# Log probability of the produced state sequence\n",
    "for o, s in zip(obs_sequence.T[0], state_sequence):\n",
    "    print(\"{} -> {}\".format(states[int(s)], observations[int(o)]))\n",
    "print(\"\\nprob = {}\\nlog_prob = {}\".format(np.exp(logprob), logprob))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [Q6](https://github.com/chauvinSimon/hmm_for_autonomous_driving/blob/master/README.md#q6) - How to estimate the parameters of our HMM when no state annotations are present in the training data?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unsupervised Learning\n",
    "**The HMM parameters** are estimated based on some observation data using the [`fit()`](https://hmmlearn.readthedocs.io/en/latest/api.html#hmmlearn.hmm.MultinomialHMM.fit) method. It implements the **Baum-Welch algorithm**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The `fit()` method is not \"const\". Therefore save the original parameters for comparison.\n",
    "old_transmat = discrete_model.transmat_\n",
    "old_emissionprob = discrete_model.emissionprob_\n",
    "old_startprob = discrete_model.startprob_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Random unbalanced sampling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "biased_sampling = \n",
      "[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1\n",
      " 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1\n",
      " 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0]\n"
     ]
    }
   ],
   "source": [
    "# On purpose, one could generate an unbalanced sampling.\n",
    "biased_sampling = np.random.choice([0, 1], size=(100,), p=[1./10, 9./10])\n",
    "print(\"biased_sampling = \\n{}\".format(biased_sampling))\n",
    "\n",
    "obs_sequence = np.array([biased_sampling]).T\n",
    "new_model = discrete_model.fit(obs_sequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "Due to **random sampling**, two consecutive observations are **independent**.\n",
    "- Therefore the **new transition matrix** is almost **uniform**.\n",
    "\n",
    "The significant **unbalance in the observation distribution** is captured by the **emission model**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---\n",
      "old_transmat = \n",
      "[[0.6 0.4]\n",
      " [0.2 0.8]]\n",
      "new_transmat = \n",
      "[[0.45266665 0.54733335]\n",
      " [0.43276709 0.56723291]]\n",
      "---\n",
      "old_emissionprob = \n",
      "[[0.4 0.6]\n",
      " [0.8 0.2]]\n",
      "new_emissionprob = \n",
      "[[0.19693203 0.80306797]\n",
      " [0.07730514 0.92269486]]\n",
      "---\n",
      "old_startprob = \n",
      "[0.33333333 0.66666667]\n",
      "new_startprob = \n",
      "[0.34208472 0.65791528]\n"
     ]
    }
   ],
   "source": [
    "print(\"---\\nold_transmat = \\n{}\".format(old_transmat))\n",
    "print(\"new_transmat = \\n{}\".format(new_model.transmat_))\n",
    "print(\"---\\nold_emissionprob = \\n{}\".format(old_emissionprob))\n",
    "print(\"new_emissionprob = \\n{}\".format(new_model.emissionprob_))\n",
    "print(\"---\\nold_startprob = \\n{}\".format(old_startprob))\n",
    "print(\"new_startprob = \\n{}\".format(new_model.startprob_))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Todo\n",
    "Here are some **ideas for future works**:\n",
    "- **Evaluate the trained model** on a **test set**.\n",
    "    - After training, it would be interesting to **assess the new model** by **submitting an observation drawn from the same distribution**.\n",
    "    - Similar to other supervised learning approaches. \n",
    "- Use **real data**.\n",
    "    - The [NGSIM US-101 highway dataset](https://catalog.data.gov/dataset/next-generation-simulation-ngsim-vehicle-trajectories) contains a collection of detailed **vehicle trajectory data** recorded on a **5-lane highway**.\n",
    "    - It could be interesting to **derive HMM parameters** based on this data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "draft",
   "language": "python",
   "name": "draft"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
